Critically Engaging with AI Ethics¶

In this lab we will be critically engaging with existing datasets that have been used to address ethics in AI. In particular, we will explore the Jigsaw Toxic Comment Classification Challenge. This challenge brought to light bias in the data that sparked the Jigsaw Unintended Bias in Toxicity Classification Challenge.

In this lab, we will dig into the dataset ourselves to explore the biases. We will further explore other datasets to expand our thinking about bias and fairness in AI in relation to aspects such as demography and equal opportunity as well as performance and group unawareness of the model. We will learn more about that in the tutorial below.

Task 1: README!¶

This week, coding activity will be minimal, if any. However, as always, you will be expected to incorporate your analysis, thoughts and discussions into your notebooks as markdown cells, so I recommend you start up your Jupyter notebook in advance. As always, remember:

  • To ensure you have all the necessary Python libraries/packages for running code you are recommended to use your environment set up on the Glasgow Anywhere Student Desktop.
  • Start anaconda, and launch Jupyter Notebook from within Anaconda**. If you run Jupyter Notebook without going through Anaconda, you might not have access to the packages installed on Anaconda.
  • If you run Anaconda or Jupyter Notebook on a local lab computer, there is no guarantee that these will work properly, that the packages will be available, or that you will have permission to install the extra packages yourself.
  • You can set up Anaconda on your own computer with the necessary libraries/packages. Please check how to set up a new environement in Anaconda and review the minimum list of Python libraries/packages, all discussed in Week 4 lab.
  • We strongly recommend that you save your notebooks in the folder you made in Week 1 exercise, which should have been created in the University of Glasgow One Drive - do not confuse this with personal and other organisational One Drives. Saving a copy of your notebooks on the University One Drive ensures that it is backed up (the first principles of digital preservation and information mnagement).
  • When you are on the Remote desktop, the University of Glasgow One Drive should be visible in the home directory of the Jupyter Notebook. Other machines may require additional set up and/or navigation for One Drive to be directly accessible from Jupyter Notebook.

Task 2: Identifying Bias¶

This week we will make use of one of the Kaggle tutorials and their associated notebooks to learn how to identify different types of bias. Biases can creep in at any stage of the AI task, from data collection methods, how we split/organise the test set, different algorithms, how the results are interpreted and deployed. Some of these topics have been extensively discussed and as a response, Kaggle has developed a course on AI ethics:

  • Navigate to the Kaggle tutorial on Identifying Bias in AI.
  • In this section we will explore the Jigsaw Toxic Comment Classification Challenge to discover different types of biases that might emerge in the dataset.

Task 2-a: Understanding the Scope of Bias¶

Read through the first page of the [Kaggle tutorial on Identifying Bias in AI] to understand the scope of biases discussed at Kaggle.

How many types of biases are described on the page?

  • There are six types of bias, these include historical, representation, measurement, aggregation, evaluation, and deployment.

Which type of bias did you know about already before this course and which type was new to you?

  • Before this course I didn't really know much about bias, just the basic definition. So most of them were new to me and it is interesting to learn more in depth about them.

Can you think of any others? Create a markdown cell below to discuss your thoughts on these questions.

  • Another one could possibly be implicit bias, this occurs when assumptions are made based on one's personal experiences that don't always apply generally.

Note that the biases discussed in the tutorial are not an exhaustive list. Recall that biases can exist across the entire machine learning pipeline.

  • Scroll down to the end of the Kaggle tutorial page and click on the link to the exercise to work directly with a model and explore the data.**

Task 2-b: Run through the tutorial. Take selected screenshorts of your activity while doing the tutorial.¶

  • Discuss with your peer group, your findings about the biases in the data, including types of biases.
  • Demonstrate your discussion with examples and screenshots of your activity on the tutorial. Present these in your own notebook.

Modify the markdown cell below to address the Tasks 2-a and 2-b.

Markdown for discussing bias

  1. If there is inherent bias in the input data, its likely to show in the algorithm's output decisions.

  2. Sampling bias, is the collection of data to develop a machine learning model. Over or under sampling can occur, leading to output data being biased towards a particular demographic.

  3. Algorithm bias, is the algorithm chosen to develop a machine learning model. There are many to choose from such as linear regression or decision trees.

Task 3: Large Language Models and Bias: Word Embedding Demo¶

Go to the embedding projector at tensorflow.org. This may take some time to load so be patient! There is a lot of information being visualised. This will take especially long if you select "Word2Vec All" as your dataset. The projector provides a visualisation of the langauge language model called Word2Vec.

This tool also provides the option of visualising the organisation of hand written digits from the MNIST dataset to see how data representations of the digits are clustered together or not. There is also the option of visualising the iris dataset from scikit-learn with respect to their categories. Feel free to explore these as well if you like.

For the current exercise, we will concentrate on exploring the relationships between the words in the Word2Vec model. First, select Word2Vec 10K from the drop down menu (top lefthand side). This is a reduced version of Word2Vec All. You can search for words by submitting them in the search box on the right hand side.

Task 3.1: Initial exploration of words and relationships¶

  • Type apple and click on Isolate 101 ppints. This reduces the noise. Note how juice, fruit, wine are closer together than macintosh, computers and atari.
  • Try also words like silver and sound. What are your observations. Does it seem like words related to each other are sitting closer to each other?

Task 3.2: Exploring "Word2Vec All" for patterns¶

  • Try to load "Word2Vec All" dataset if you can (this may take a while so be patient!) and explore the word engineer, drummeror any other occupation - what do you find?
  • Do you think perhaps there are concerns of gender bias? If so, how? If not, why not? Discuss it with our peer group and present the results in a your notebook.
  • Why not make some screenshots to embed into your notebook along with your comment? This could make it more understandable to a broader audience.
  • Do not forget to include attribution to the authors of the Projector demo.

Modify the markdown cell below to present your thoughts.

Markdown cell for discussing large language models

  1. Word - Engineer Count - 607 Related Words Examples - architect, technology, science, electrical and mechanical.

  2. Word - Drummer Count - 274 Related Words Examples - musician, pianist, jazz, bands and percussion.

In [5]:
from IPython.display import*
In [6]:
Image ("engineer.png")
Out[6]:
In [ ]:
from IPython.display import*
In [7]:
Image ("drummer.png")
Out[7]:

Task 4: Thinking about AI Fairness¶

So we now know that AI models (e.g. large language models) can be biased. We saw that with the embedding projector already. We discussed in the previous exercise about the machine learning pipeline, how the assessment of datasets can be crucicial to deciding the suitability of deploying AI in the real world. This is where data connects to questions of fairness.

  • Navigate to the Kaggle Tutorial on AI Fairness.

Task 4-a: Topics in AI Fairness¶

Read through the page to understand the scope of the fairness criteria discussed at Kaggle. Just as we dicussed with bias, the fairness criteria discussed at Kaggle is not exhaustive.

How many criteria are described on the page?

  • There are four fairness criteria's, demographic parity, equal opportunity, equal accuracy and group unaware.

Which criteria did you know about already before this course and which, if any, was new to you?

  • Before this course I didn't really know much about fairness criteria's, just the basic definition of fairness. So most of them were new to me and it is interesting to learn more in depth about them.

Can you think of any other criteria? Create a markdown cell and note down your discussion with your peer group on these questions.

  • AI systems work in various ways for different individuals. So to increase fairness, the assessment and mitigation of these harms are important.

Task 4-b: AI fairness in the context of the credit card dataset.¶

Scroll down to the end of the page on AI fairness to find a link to another interactive exercise to run code in a notebook using credit card application data.

  • Run the tutorial, while taking selected screenshots.
  • Discuss your findings with your peer group.
  • Note down the key points of your activity and discussion in your notebook using the example and screenshots of your activity on the tutorial.

Report the results of the activity and discussion by modifying the markdown cell below.

Markdown cell for discussing fairness

  1. Varieties of Fairness Part One
In [ ]:
from IPython.display import*
In [8]:
Image ("varieties-of-fairness-1.png")
Out[8]:
In [ ]:
from IPython.display import*
In [9]:
Image ("model.png")
Out[9]:
  1. Understand the Baseline Model
In [11]:
from IPython.display import*
In [12]:
Image ("baseline-model.png")
Out[12]:
In [ ]:
from IPython.display import*
In [13]:
Image ("group-unaware-model.png")
Out[13]:
  1. Varieties of Fairness Part Two
In [ ]:
from IPython.display import*
In [14]:
Image ("varieties-of-fairness-2.png")
Out[14]:
In [ ]:
from IPython.display import*
In [15]:
Image ("evaluated-model.png")
Out[15]:
  1. Varieties of Fairness Part Three
In [ ]:
from IPython.display import*
In [16]:
Image ("final-model.png")
Out[16]:

Task 5: AI and Explainability¶

In this section we will explore the reasons behind decisions that AI makes. While this is really hard to know, there are some approaches developed to know which features in your data (e.g. median_income in the housing dataset we used before) played a more important role than others in determining how your machine learning model performs. One of the many approaches for assessing feature importance is permutation importance.

The idea behind permutation importance is simple. Features are what you might consider the columns in a tabulated dataset, such as that might be found in a spreadsheet.

  • The idea of permutation importance is that a feature is important if the performance of your AI program gets messed up by shuffling or permuting the order of values in that feature column for the entries in your test data.
  • The more your AI performance gets messed up in response to the shuffling, the more likely the feature was important for the AI model.

To make this idea more concrete, read through the page at the Tutorial on Permutation Importance at Kaggle. The page describes an example to "predict a person's height when they become 20 years old, using data that is available at age 10".

The page invites you to work with code to calculate the permutation importance of features for an example in football to predict "whether a soccer/football team will have the "Man of the Game" winner based on the team's statistics". Scroll down to the end of the page to the section "Your Turn" where you will find a link to an exercise to try it yourself to calculate the importance of features in a Taxi Fare Prediction dataset.

Task 1-a: Carry out the exercise, taking screenshots of the exercise as you make progress. Using screen shots and text in your notebook, answer the following question:¶

Excercise Results:

Question One Solution - It would be helpful to know whether New York City taxis vary prices based on how many passengers they have. Most places do not change fares based on numbers of passengers. If you assume New York City is the same, then only the top 4 features listed should matter. At first glance, it seems all of those should matter equally.

Question Two Solution

In [ ]:
from IPython.display import*
In [18]:
Image ("q2.png")
Out[18]:

Question Three Solution - 1. Travel might tend to have greater latitude distances than longitude distances. If the longitudes values were generally closer together, shuffling them wouldn't matter as much. 2. Different parts of the city might have different pricing rules (e.g. price per mile), and pricing rules could vary more by latitude than longitude. 3. Tolls might be greater on roads going North<->South (changing latitude) than on roads going East <-> West (changing longitude). Thus latitude would have a larger effect on the prediction because it captures the amount of the tolls.

Question Four Solution

In [ ]:
from IPython.display import*
In [19]:
Image ("q4.png")
Out[19]:

Question 5 Solution - The scale of features does not affect permutation importance per se. The only reason that rescaling a feature would affect PI is indirectly, if rescaling helped or hurt the ability of the particular learning method we're using to make use of that feature. That won't happen with tree based models, like the Random Forest used here. If you are familiar with Ridge Regression, you might be able to think of how that would be affected. That said, the absolute change features are have high importance because they capture total distance traveled, which is the primary determinant of taxi fares...It is not an artifact of the feature magnitude.

Question Six Solution - We cannot tell from the permutation importance results whether traveling a fixed latitudinal distance is more or less expensive than traveling the same longitudinal distance. Possible reasons latitude feature are more important than longitude features 1. latitudinal distances in the dataset tend to be larger 2. it is more expensive to travel a fixed latitudinal distance 3. Both of the above If abs_lon_change values were very small, longitues could be less important to the model even if the cost per mile of travel in that direction were high.

Task 1-b: Reflecting on Permutation Importance.¶

  • Do you think the permutation importance is a reasonable measure of feature importance?

Permutation importance is a reasonable measure of feature importance in AI as it evaluates the the impact of random values of features on a model's performance.

  • Can you think of any examples where this would have issues?

An issue is that it may not capture interactions between features, as it assesses features in isolation. ALso that it can be expensive, especially for models with large numbers, as it will need to re-evaluate each feature.

Task 6: Further Activities for Broader Discussion¶

Apart from the Jigsaw Toxic Comment Classification Challenge another challenge you might explore is the Inclusive Images Challenge. Read at least one of the following.

  • The announcement of the Inclusive Images Challenge made by Google AI. Explore the Open Images Dataset V7 - this is where the Inclusive Images Challenge dataset comes from.
  • Article summarising the Inclusive Image Challenge at NeurIPS 2018 conference
  • Explore the recent controversy about bias in relation to PULSE which, among other things, sharpens blurry images.
  • Given your exploration in the sections above, what problems might you foresee with these tasks attempted with the Jigsaw dataset on toxicity?

There are many concepts (e.g. model cards and datasheets) omitted in discussion above about AI and Ethics. To acquire a foundational knowledge of transparency, accessibility and fairness:

  • You are welcome to carry out the rest of the Kaggle course on Intro to AI Ethics to see some ideas from the Kaggle community.
  • You are welcome to carry out the rest of the Kaggle tutorial on explainability but these are a bit more technical in nature.

Summary¶

In this lab, you explored a number of areas that pose challenges with regard to AI and ethics: bias, fairness and explainability. This, and other topics in reposible AI development, is currently at the forefront of the AI landscape.

The discussions coming up in the lectures on applications of AI (to be presented by guest lecturers in the weeks to come) will undoubtedly intersect with these concerns. In preparation, you might think, in advance, about what distinctive questions about ethics might arise in AI applications in law, language, finance, archives, generative AI and beyond.

In [ ]: